The specific aims of this lab are:
The unit learning outcomes addressed are:
Your tutor will lead you through a getting to know each other exercise.
The program/language we will be using to analyse data this semester is called R. We will mostly access R through the IDE1 RStudio. Both are free to use.
You will need to install (or upgrade to the latest version if you already have them installed):
When working in R, there are some functions and data sets that are always available, but the real strength of R comes from its community of developers who continually improve the set of available features and add additional functionality through an ecosystem of “packages.”
A collection of packages, mostly backed by RStudio, called the tidyverse has attracted a lot of attention in the statistics and data science sphere (Wickham et al. 2019). You can install the entire suite of tidyverse packages using the command
install.packages("tidyverse")
This will install ggplot2 (graphics), dplyr (data manipulation), readr (importing data) and a whole slew of other useful packages (Wickham 2016; Wickham et al. 2018; Wickham, Hester, and Francois 2017). You only need to use install.packages() the first time. When you actually want to use the packages, you need to load them into the environment,
When you load the tidyverse master package using library("tidyverse), R goes and loads a bunch of other packages. It also printed out a few things, telling us which packages were loaded (and which version), and it tells us that some functions that were previously available have now been masked by the newly loaded packages. For example if you wanted to use the filter() function from the stats package, you now need to use stats::filter().
You could also load each package individually, e.g.
For an overview of the functionality in the tidyverse see R for Data Science (Wickham and Garrett 2017).
We’re going to dive in the deep end. We’re going to install a package called palmerpenguins that contains a neat data set for us to experiment with (Horst, Hill, and Gorman 2020).
Let’s install it:
install.packages("palmerpenguins")
If all went well it’s now installed on your computer (which you only need to do once), but it’s not currently loaded (meaning the functionality is not yet available). We load the package using the library() function. To help think about this, installing the package adds it to our library collection but when we do that, the package is stored on the shelf in the library and not really accessible. To actually use the package we need to take it off the shelf and check it out of the library which we do using the library() function.
What does this package do? We can see the help page using ? or help()
?palmerpenguins
# help(palmerpenguins)
Most packages bundle up a set of functions and make them available to the user when it is loaded. The palmerpenguins package is a little unusual in that it it doesn’t provide any functions, just two data sets, penguins and penguins_raw. We will start with the raw data in penguins_raw. We can find out a bit more about it using the help:
?penguins_raw
When the package is loaded, the data is invisibly available (i.e. it doesn’t show up in the Global Environment) until we use it for the first time. We can get an overview of the structure of the stored data using the glimpse() function from the dplyr package:
glimpse(penguins_raw)
Rows: 344
Columns: 17
$ studyName <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0…
$ `Sample Number` <chr> "1", "2", "3", "4", "5", "6", "7", "8"…
$ Species <chr> "Adelie Penguin (Pygoscelis adeliae)",…
$ Region <chr> "Anvers", "Anvers", "Anvers", "Anvers"…
$ Island <chr> "Torgersen", "Torgersen", "Torgersen",…
$ Stage <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg St…
$ `Individual ID` <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1"…
$ `Clutch Completion` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Ye…
$ `Date Egg` <date> 2007-11-11, 2007-11-11, 2007-11-16, 2…
$ `Culmen Length (mm)` <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9…
$ `Culmen Depth (mm)` <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8…
$ `Flipper Length (mm)` <dbl> 181, 186, 195, NA, 193, 190, 181, 195,…
$ `Body Mass (g)` <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625…
$ Sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMAL…
$ `Delta 15 N (o/oo)` <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.6…
$ `Delta 13 C (o/oo)` <dbl> NA, -24.69454, -25.33302, NA, -25.3242…
$ Comments <chr> "Not enough blood for isotopes.", NA, …
When you use glimpse(), it shows one line for each column in the data frame, with the variable name, as well as what type of variable R thinks the column is. Can you work out what each of them mean? Do they all make sense?
chr refers to character (i.e. text) variables, dbl are numeric variables (short for double) and date are date variables.
Other common data types that you’ll encounter include int refers to integer variables, dttm are date time variables,lgl are logical variables (i.e. TRUE or FALSE)
Also, the variable Sample number is a sample identifier, so it’s not really a numeric variable, it’s actually a categorical variable. We can change it to a character vector as follows:
penguins_raw = penguins_raw %>%
mutate(`Sample Number` = as.character(`Sample Number`))
# alternatively using
# penguins_raw$`Sample Number` = as.character(penguins_raw$`Sample Number`)
Note that the %>% is a pipe operator and pipes the data frame penguins_raw into the first argument of the function mutate() which then mutates (or creates if it didn’t already exist) the variable Sample Number. Use glimpse again to confirm that the change has been applied.
Before we go any further, we need to notice that the variable names of penguins_raw do not lend themselves to easy use for coding. Specifically, spaces are tricky to deal with and special characters like parentheses or slashes aren’t great to have in a variable name. We can fix this using one of my favourite packages, the janitor package. If it’s your first time using the janitor package, you need to start with installing it:
install.packages("janitor")
The janitor package has an incredibly useful function called clean_names() that sensibly sanitises column names to make it easier for subsequent analysis.
old_names = colnames(penguins_raw)
penguins = penguins_raw %>%
janitor::clean_names()
We stored the old names in old_names. Create a new variable called new_names with the clean column names and compare the old names and the “cleaned” names side by side using the bind_cols function from the dplyr package. Discuss the changes that have been made to the column names.
# A tibble: 17 × 2
original clean
<chr> <chr>
1 studyName study_name
2 Sample Number sample_number
3 Species species
4 Region region
5 Island island
6 Stage stage
7 Individual ID individual_id
8 Clutch Completion clutch_completion
9 Date Egg date_egg
10 Culmen Length (mm) culmen_length_mm
11 Culmen Depth (mm) culmen_depth_mm
12 Flipper Length (mm) flipper_length_mm
13 Body Mass (g) body_mass_g
14 Sex sex
15 Delta 15 N (o/oo) delta_15_n_o_oo
16 Delta 13 C (o/oo) delta_13_c_o_oo
17 Comments comments
The default settings of the janitor::clean_names() function converts camelCase to snake_case, spaces get replaced with underscores, all text becomes lowercase and any special characters are removed (e.g. the parentheses and slashes). If present, it would also convert “%” to “percent.” See the vignette for further details.
Whenever I read a new data set into R, the first thing I do is run janitor::clean_names() on the data frame so that I know all the variable names are in a consistent format. It makes downstream analysis and column referencing much easier.
Let’s visualise some of the data using the ggplot2 package. To make use of the ggplot2 package, you need to install it (you probably already have, it comes when you install the tidyverse) and then load it (you may also have already done this if you loaded the tidyverse above).
Your tutor will work you through the details. The code below generates a scatter plot of flipper length against body mass and colours the points by species.
library("ggplot2")
penguins %>%
ggplot() +
# add the aesthetics
aes(x = body_mass_g,
y = flipper_length_mm,
colour = species) +
# add a geometry
geom_point() +
# tidy up the labels
labs(x = "Body mass (g)",
y = "Flipper length (mm)",
colour = "Species")
Note that the species variable is a bit long, we really only need to keep the first word, so let’s do that using the word() function from the stringr package (also part of the tidyverse). In the code below, not that we’re overwriting the species column in the penguins data frame using the mutate() function from the dplyr package.
Now regenerate the plot.
library("ggplot2")
penguins %>%
ggplot() +
# add the aesthetics
aes(x = body_mass_g,
y = flipper_length_mm,
colour = species) +
# add a geometry
geom_point() +
# tidy up the labels
labs(x = "Body mass (g)",
y = "Flipper length (mm)",
colour = "Species")
Let’s save that plot as a png file so you can print it out and stick it on the fridge!
ggsave(filename = "myfirstggplot.png")
The ggplot2 cheat sheet is a great, concise resource to find out some of what’s possible. You can also access this from withing RStudio by clicking the menu item Help > Cheatsheets.
facet_wrap() to generate a plot for each species and island combination.geom_smooth(method = "lm").geom_histogram() to create a histogram for flipper length, coloured by species.ggsave().write_csv() function which can be found in the readr package.penguins %>%
drop_na(sex) %>%
ggplot() +
aes(x = culmen_length_mm,
y = flipper_length_mm,
colour = sex) +
geom_point(alpha = 0.5) +
facet_wrap(vars(species, island)) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Culmen length (mm)",
y = "Flipper length (mm)",
colour = "Sex") +
scale_color_brewer(palette = "Set1") +
theme(legend.position = "bottom")
Just for fun, let’s make it interactive using the plotly package (Sievert et al. 2017).
# install.packages("plotly")
library("plotly")
myplot = penguins %>%
ggplot() +
# add the aesthetics
aes(x = body_mass_g,
y = flipper_length_mm,
colour = species) +
# add a geometry
geom_point() +
# tidy up the labels
labs(x = "Body mass (g)",
y = "Flipper length (mm)",
colour = "Species")
plotly::ggplotly(myplot)
Here we have created an “object” in R called myplot which stores the ggplot. You can print the standard ggplot by typing myplot in the console and pressing Enter. We can also do other things with the object, such as add additional themes to it, e.g. myplot + theme_linedraw() or as is done above, we can feed it as an input into the function ggplotly from the plotly package.
Note that if you are implementing this in an R Markdown document (see below) the interactive graphics will only work if you output as a HTML document, it won’t work for PDF or Word outputs.
Markdown is a lightweight markup language (in the same way the HTML is a markup language). One of the big advantages of markdown as a language is its simplicity - it forces you to focus on content rather than play with styling. R markdown is a great way to do reproducible research and generate reports. You can compile (or knit) R Markdown documents into a variety of formats, including HTML, Word, PDF, as well as presentations.
For more details on using R Markdown see http://rmarkdown.rstudio.com. A useful guide to help you get started can be found here and there’s a cheat sheet here. A book on R Markdown which has everything you could possibly want to know about R Markdown and a whole lot more (Xie, Allaire, and Grolemund 2018). There’s also R Markdown for Scientists which gives a more concise overview.
Rmd file (Rmd is the R Markdown file extension). In RStudio File -> New File -> R Markdown...Knit button up the top of the source window. You click that button to turn the markdown into HTML (or PDF or Word).r, the (optional) chunk name and any arguments: ```{r} or ```{r chunk_name, tidy = TRUE}. The chunk also ends with three back ticks ```. Examples can be seen in the template that opens along as a new file in RStudio (you can delete most of the template except the YAML code at the top).You can embed static plots in a R Markdown document without doing anything special. Important chunk options are fig.width and fig.height to set the figure width and height for example ```{r, fig.width = 4, fig.height = 6}.
Some useful chunk options:
tidy = TRUE makes the R code more readable (proper spacing)results = 'hide' hide the results of the chunk output (i.e. don’t show them)results = 'hold' hold the results of the chunk output until all commands in the chunk have been runwarning = FALSE don’t show any warning messages (e.g. when ggplot2 drops observations)message = FALSE don’t show any messages (e.g. when packages load){r chunkname} you can name your chunks with text immediately after the r. This can be particularly useful when errors pop up as it makes it easier to identify which chunk the error occurs in.Take the work you did with the Palmer penguins date and write it up in a R Markdown document. Detail what you did, including the packages and functions you used, in the text for future you. Knit using HTML.
When you do this, you’ll find that each time you knit your document, it re-runs all your code and loads all the libraries from scratch. This is a) a pain and b) fantastic for reproducibility. It’s a pain because you’ve already done things in the “global” environment, loaded data and packages, generated figures, etc, and it takes time for things to be re-run. It’s fantastic for reproducibility because it means that everything you do has to be in the Rmd file for the knit to be successful. I.e. you’ll need to load all the packages you use in the Rmd file, you’ll need to do all the data manipulation, and include all the plot code in the Rmd file.
Later in semester you will need to submit a R Markdown report (the first assignment). To help familiarise you with this process, we strongly recommend you try submitting your R Markdown report to the Lab 00 practice submission assignment on Canvas. We want to make sure you’re familiar with the process of uploading a HTML file. You should always double check to make sure it has actually been submitted.
There are no marks associated with Lab 00 practice submission.
? followed by a function or package name do?# in your R code?install.packages("palmerpenguins") and library("palmerpenguins")?Packages used:
glimpse() and mutate() functions.Integrated developer environment